Case 1 Learning Objective 4

Author

Lisa Levoir and Jeffrey Zhuohui Liang

Published

September 1, 2023

1 Analyzing medical students scores

Background given in the case description: “The course lasts twelve weeks. Throughout the course, students are assessed in multiple ways, including weekly quizzes, slide exams, and essays. They also take an end of course exam that includes essay, short answer, and multiple-choice components. The final data has the average scores for those assessments. Students are required to take laboratory practical (gross anatomy, histology, pathology and neuroanatomy) exams which are averaged into the final grade. Students also take a National Board of Medical Examiners (NBME) standardized exam in each course. Theoretically, if they do well on these exams, they should do well in the course overall. All of the assessments have been calculated on a 100- point scale.”

1.1 Questions from Learning Objective 4

  • How should we define not pass/ marginal pass/ pass thresholds and criteria?
  • How do these thresholds compare to final exam scores?

1.2 Data

There are 92 students. 2 students scored below 70 on the final exam which is grounds for an immediate failing threshold. 3 students scored between a 70 & 80 on the final which could be considered students for further scrutiny.

Code
label(dt$quiz)      <- "Quiz score (mean weekly performance)"
label(dt$nbme)      <- "National Board of Medical Examiners score"
label(dt$ga)        <- "Gross anatomy (mean score)"
label(dt$slide)      <- "Slide exams score (mean)"
label(dt$part.c)      <- "Part C score"
label(dt$essay)      <- "Essay score (mean)"
label(dt$eob.exam)      <- "End of Block (course term) exam"
label(dt$final)      <- "Final score"

tableby(~quiz + nbme + ga + slide + part.c + essay+ eob.exam + final , data=dt, topclass="Rtable1-zebra",control = tableby.control(numeric.stats = c("meansd","median","range"))) %>% summary()%>% kable()
Warning in tableby(~quiz + nbme + ga + slide + part.c + essay + eob.exam + : unused arguments: topclass
Overall (N=92)
Quiz score (mean weekly performance)
   Mean (SD) 0.821 (0.069)
   Median 0.820
   Range 0.660 - 1.000
National Board of Medical Examiners score
   Mean (SD) 89.870 (5.452)
   Median 91.000
   Range 74.000 - 100.000
Gross anatomy (mean score)
   Mean (SD) 82.974 (9.887)
   Median 83.929
   Range 49.490 - 100.000
Slide exams score (mean)
   Mean (SD) 82.252 (10.005)
   Median 83.925
   Range 53.060 - 100.000
Part C score
   Mean (SD) 81.112 (8.700)
   Median 81.590
   Range 59.630 - 100.000
Essay score (mean)
   Mean (SD) 86.767 (5.418)
   Median 87.250
   Range 71.250 - 95.750
End of Block (course term) exam
   Mean (SD) 84.913 (6.833)
   Median 85.000
   Range 65.000 - 99.000
Final score
   Mean (SD) 88.457 (5.628)
   Median 88.500
   Range 68.000 - 100.000

Notes:

  • There is no missingness in the dataset.

  • Part C score is “like a catch-all exam if the knowledge can’t be obtained through their lab and essay assessments.”

  • Not included in our data (but included in the student evaluation) is the score for the laboratory practical which “has multiple assessment scores which are captured in the data such as the histology, pathology, etc. - which are not specifically named like that.”

    • We will disregard this for our purposes

1.2.1 Scores based on stratifying by passing the final exam at 70% threshold

Code
dt = dt %>% 
  mutate(quiz = 100*quiz)

tableby(pass~.,dt %>% 
          select(-id) %>% 
          mutate(pass = final>70),
        control = 
          tableby.control(
            numeric.stats = c("meansd","median","range"),
            digits=1
          )) %>% 
  summary() %>% 
  knitr::kable()
FALSE (N=2) TRUE (N=90) Total (N=92) p value
Quiz score (mean weekly performance) 0.011
   Mean (SD) 70.0 (1.9) 82.4 (6.7) 82.1 (6.9)
   Median 70.0 82.3 82.0
   Range 68.7 - 71.3 66.0 - 100.0 66.0 - 100.0
National Board of Medical Examiners score < 0.001
   Mean (SD) 76.0 (2.8) 90.2 (5.1) 89.9 (5.5)
   Median 76.0 91.0 91.0
   Range 74.0 - 78.0 78.0 - 100.0 74.0 - 100.0
Gross anatomy (mean score) 0.002
   Mean (SD) 61.7 (17.3) 83.4 (9.3) 83.0 (9.9)
   Median 61.7 84.4 83.9
   Range 49.5 - 74.0 51.5 - 100.0 49.5 - 100.0
Slide exams score (mean) 0.001
   Mean (SD) 60.5 (2.5) 82.7 (9.6) 82.3 (10.0)
   Median 60.5 84.2 83.9
   Range 58.7 - 62.2 53.1 - 100.0 53.1 - 100.0
Part C score < 0.001
   Mean (SD) 61.4 (2.5) 81.5 (8.3) 81.1 (8.7)
   Median 61.4 82.0 81.6
   Range 59.6 - 63.2 64.2 - 100.0 59.6 - 100.0
Essay score (mean) < 0.001
   Mean (SD) 73.6 (3.4) 87.1 (5.1) 86.8 (5.4)
   Median 73.6 87.2 87.2
   Range 71.2 - 76.0 71.2 - 95.8 71.2 - 95.8
End of Block (course term) exam < 0.001
   Mean (SD) 66.5 (2.1) 85.3 (6.3) 84.9 (6.8)
   Median 66.5 85.0 85.0
   Range 65.0 - 68.0 69.0 - 99.0 65.0 - 99.0
Final score < 0.001
   Mean (SD) 68.0 (0.0) 88.9 (4.8) 88.5 (5.6)
   Median 68.0 89.0 88.5
   Range 68.0 - 68.0 78.0 - 100.0 68.0 - 100.0

Below is a pairs plot where students are divided into groups depending on whether they passed or if they scored below 80% which we called “almost fail”. Like mentioned above, these students deserve more scrutiny - how did they perform on other assessments?

Code
set.seed(123123)
pc = prcomp(dt %>% select(-id,-final) %>% mutate_all(scale))

#create pairs plot
ggpairs(dt %>% select(-id),
        aes(color=ifelse(final>80,"pass","(almost)fail\n")),
        progress = F) + labs(caption = "Stratified by final scores: (almost)fail is <=80 and pass is >80")

There appears to be pretty distinct separation in the distribution in the performance of students who scored less than 80% on the final vs. students who exceeded that threshold. Looking at the rightmost column, there is strong correlation (>0.702) between scores on the final exam and other metrics.

It is likely wise to stick with precedent of 70% being a pass threshold, since the distributions of all other evaluations are even more disparate as shown by the density plots below:

Code
plots <- list()

for(i in 1:7){
  plot <- ggplot(dt, aes_string(x= names(dt[,i+1]))) + geom_density(aes(color= ifelse(final>70, "pass", "fail"))) + theme(legend.position="none")
  plots[[i]] <- plot
}

plot_grid(plots[[1]], plots[[2]], plots[[3]], plots[[4]], plots[[5]], plots[[6]], plots[[7]], ncol = 4)

However, there isn’t as much separation for GA - perhaps one student did relatively better in Gross Anatomy, but their other scores during and outside of the course are much below their classmates.

Code
labels <- c("student a", "student b", "class mean")
summary <- rbind(dt[which(dt$final<70), 2:9], colMeans(dt[,2:9]))
summary2 <- cbind(labels, summary)
summary2 %>% mutate(across(where(is.numeric), ~ round(., digits = 0))) %>%
  kable(.)
labels quiz nbme ga slide part.c essay eob.exam final
student a 71 78 49 62 60 76 65 68
student b 69 74 74 59 63 71 68 68
class mean 82 90 83 82 81 87 85 88

If we stratify by who got a 90 on the final, for example, there is much more overlap for the distributions - so we aren’t separating out who is not performing well enough overall with this stringent of a cutoff.

Code
plots <- list()

for(i in 1:7){
  plot <- ggplot(dt, aes_string(x= names(dt[,i+1]))) + geom_density(aes(color= ifelse(final>90, "pass", "fail"))) + theme(legend.position="none")
  plots[[i]] <- plot
}

plot_grid(plots[[1]], plots[[2]], plots[[3]], plots[[4]], plots[[5]], plots[[6]], plots[[7]], ncol = 4)

1.2.2 PCA

We did further analysis with K means clustering with 4 groups (A,B,C,F) in an effort to identify particular groups of students classified based on their scores as an additional way to justify a separation threshold for failing. But as you can see, we failed to identify any useful subgroup in the data. There are potential outlier students with lower performance, so we would like an evaluation method that will help us capture these students with low performance.

Code
#k means clusters
cl = kmeans(dt %>% select(-id) %>% mutate_all(scale),
            centers = 4)$cluster
dt %>% left_join(tibble(id = dt$id,cluster = as.factor(cl))) %>% 
  cbind(pc$x) %>% 
  ggplot(aes(x=PC1,y=final,color=cluster)) +
  scale_color_calc()+
  geom_jitter()

Code
autoplot(pc,color = as.factor(cl))

1.3 Can we create a better metric?

What if we take a weighted average to calculate the overall score where 40% comes from the NBME exam (since it is independent from the instructor’s coursework but correlated with a student success) and 60% from a combination of:

  • quiz scores

  • gross anatomy

  • slide quiz scores

  • Part C

  • essay scores

  • the end of block exam

and set the students in the lowest 5% quantile to fail?

Code
overall = 
  0.6*rowMeans(dt %>% select(-id,-final,-nbme)) +
  0.4*dt$nbme
dt %>% select(-id) %>% 
  mutate(overall = overall) %>% 
  ggpairs(.,
          aes(color = ifelse(
            overall> quantile(overall,0.05),
            "pass","fail")),
          progress = F)

Code
#graph pca
dt  %>% 
  mutate(overall = overall,
         pass = overall>quantile(overall,0.05)) %>% 
  cbind(pc$x) %>% 
ggplot(aes(y=PC2,x=PC1,color=pass))+
  geom_jitter()

While the original design has a clear cut-off, this metric better reflects student’s overall performance in all fields. Those who have a lower score tend to perform worse in most other arenas: